-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix setting V1 format version for Non-REST catalogs #411
Fix setting V1 format version for Non-REST catalogs #411
Conversation
pyiceberg/table/metadata.py
Outdated
schema = data.get("schema") | ||
if schema and isinstance(schema, dict): | ||
if "schema_id" not in data["schema"]: | ||
data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I explicitly pass in a schema it's no longer a dictionary. Then this validator fails since data["schema"] is an acutal Schema and cannot be referenced. I'm not really familiar with Pydantic but my understanding of this validator is that it's meant for just converting the schema metadata a common representation that can be shared with V2 metadata. I think that should only happen when reading the actual JSON file.
I'm still working through another issue for partition specs in V1 (this error is thrown in the updated test):
pydantic_core._pydantic_core.ValidationError: 2 validation errors for TableMetadataV1
partition_spec.0
Input should be a valid dictionary [type=dict_type, input_value=('spec_id', 0), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.6/v/dict_type
partition_spec.1
Input should be a valid dictionary [type=dict_type, input_value=('fields', ()), input_type=tuple]
For further information visit https://errors.pydantic.dev/2.6/v/dict_type
I'm not sure why this happens for V1 metadata explicitly constructed but V2 works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amogh-jahagirdar Yes, this is a bit awkward in Pydantic. This is a @model_validator(mode="before")
which means that it will always apply the validator before trying to construct the model. This will allow you to correct certain things before applying the checks. This particular function will make sure that the schema-id is set. Where schema_id
is required. Keep in mind that this also does have applied the aliases, so both schema_id
and schema-id
are valid.
The before-mode model validator also be applied when you initialize something off: TableMetadataV1(bla=(1,2,3))
The error that you're seeing is after the mode="before"
validators are applied. I would put a breakpoint in the validator, and trace up the stack where the bogus data comes from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this :)
pyiceberg/table/metadata.py
Outdated
data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID | ||
schema = data.get("schema") | ||
if schema and isinstance(schema, dict): | ||
if "schema_id" not in data["schema"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "schema_id" not in data["schema"]: | |
if "schema_id" not in schema: |
pyiceberg/table/metadata.py
Outdated
schema = data.get("schema") | ||
if schema and isinstance(schema, dict): | ||
if "schema_id" not in data["schema"]: | ||
data["schema"]["schema_id"] = DEFAULT_SCHEMA_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amogh-jahagirdar Yes, this is a bit awkward in Pydantic. This is a @model_validator(mode="before")
which means that it will always apply the validator before trying to construct the model. This will allow you to correct certain things before applying the checks. This particular function will make sure that the schema-id is set. Where schema_id
is required. Keep in mind that this also does have applied the aliases, so both schema_id
and schema-id
are valid.
The before-mode model validator also be applied when you initialize something off: TableMetadataV1(bla=(1,2,3))
The error that you're seeing is after the mode="before"
validators are applied. I would put a breakpoint in the validator, and trace up the stack where the bogus data comes from.
8c5f085
to
e3f5dcb
Compare
4d8e486
to
d0c964f
Compare
d0c964f
to
2354919
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this during the weekend. This looks good, I just have some concerns about setting the schema
and partition-spec
using the before-validator.
pyiceberg/table/metadata.py
Outdated
@@ -313,6 +315,34 @@ def construct_partition_specs(cls, data: Dict[str, Any]) -> Dict[str, Any]: | |||
|
|||
return data | |||
|
|||
@model_validator(mode="before") | |||
def construct_v1_spec_from_v2_fields(cls, data: Dict[str, Any]) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather avoid the before validators to patch in the partition-spec and schema (probably sort-order as well?). I'd much rather see that we explicitly pass in the spec below:
if format_version == 1:
return TableMetadataV1(
location=location,
schema=fresh_schema,
last_column_id=fresh_schema.highest_field_id,
current_schema_id=fresh_schema.schema_id,
+ partition_spec=fresh_partition_spec,
partition_specs=[fresh_partition_spec],
default_spec_id=fresh_partition_spec.spec_id,
+ sort_order=fresh_sort_order,
sort_orders=[fresh_sort_order],
default_sort_order_id=fresh_sort_order.order_id,
properties=properties,
last_partition_id=fresh_partition_spec.last_assigned_field_id,
table_uuid=table_uuid,
)
My biggest concern is again that this is a before validator, and we don't take into account the aliases (partition-specs
, partition-spec-id
, etc).
It looks like Java just ignores the schema
if there is a schemas
: https://github.com/apache/iceberg/blob/5f577f1b9902ffe6181897a439686a31fc81b89a/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L343-L368
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it just felt a bit awkward to pass in the fields in a redundant way. As long as the source of truth is correct when reading the fields, which it is as you saw for schemas
then we're good. I've updated.
Also PartitionSpec on the V1 metadata is a list[dict[str, any]], right now I work with that to preserve compatibility but that does make things a bit awkward since we have to pass in the model_dump of partition spec instead of just the normal PartitionSpec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7b3aa35
to
9fc37fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this! @amogh-jahagirdar It's great that we no longer require extra before
validator for supporting the v1 metadata. Setting the relevant fields explicitly is a smart approach
Thanks but I don't know about that tbh. I think it's a pretty bad API that we are somewhat forced to specify all fields even though it's redundant for V1 metadata (because it's deprecated so we want to specify the new fields, but partition-spec is still required). We should revisit that imo, but I think that can be done in a follow on PR since the goal here is to solve the original problem of not being able to create V1 tables in non-rest catalogs. Also doesn't really matter in practice too much since constructing new metadata is done behind the catalog anyways which is the API the vast majority of people will be interacting with. |
9fc37fe
to
4dbc2de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I think it's a pretty bad API that we are somewhat forced to specify all fields even though it's redundant for V1 metadata
I guess that's the case with maintaining backward compatibility :(. Fortunately, as you mentioned, this is behind new_table_metadata
and catalog, thereby not requiring users to specify any redundant fields.
I thought pydantic's before validator might help, but as we saw in this PR, it's more effective for parsing raw data than for constructing models directly. So, it looks like our current approach is the best we have right now.
We should revisit that ...
Totally agree! For the time being, this approach should work fine. We may need to revisit it when we need to deprecate some fields or introduce V3 metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amogh-jahagirdar for fixing this over the weekend, and @HonahX for jumping in for the review 🙌
Currently we always set V2 format for create table for non-rest Catalogs regardless of the specified version. This change addresses that.